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Abstract: This note reassess the dual nature of the Jeffreys-Lindley's 
paradox and of its considerable impact on both classical and Bayesian 
statistics, as well as on existing resolutions of the paradox. It also examines 
a recent and critical viewpoint on the paradox by Spanos (2013). 
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1. Understanding the paradox setting 

Maybe paradoxically, my own understanding of the Jeffreys-Lindley's paradox 
has always been that it pointed at the poor and even unacceptable behaviour 
of vague prior distributions when testing point-null hypotheses. For instance, 
my own attempt at solving the paradox (Robert, 1993) was definitely written 
under this understanding and aimed at suppressing the impact of an arbitrary 
normalising constant in improper priors. It is only very recently that I became 
aware that most people (Dennis Lindley included) understand the paradox as 
an irreconcilable divergence between the Bayesian _and the frequentist (f) res- 
olutions of the point-null hypothesis testing problem, blaming one of those for 
the discrepancy. (It has been reasonably argued that there is no such thing as 
one Bayesian resolution or one frequentist resolution. While I agree on principle 
with this view, I will nonetheless restrict the discussion below to the opposition 
between the p-value and the posterior probability — or equivalently the Bayes 
factor, see e.g. Kass and Wasserman, 1996.) 

I must acknowledge being rather surprised at this common focus as I see 
no reason why both approaches should agree: (a) one _is operating on the pa- 
rameter space 6, while the other (f) is produced on the sample space X, or, 
in other words, one (f) is dealing with credibility while the other dabbles in 
confidence; (b) one (f) relies solely on the point-null hypothesis Hq and the cor- 
responding distribution, while the other _opposes Hq to a marginal version of 
Hi (integrated over the parameter space Q against a specific prior distribution); 
(c) following what may be the most famous quote from Jeffreys (1939, Section 
7.2) one (f) could rejects "a hypothesis that may be true (...) because it has not 
predicted observable results that have not occurred" {{X > ccobs}, say), while 
the other _conditions upon the observed value a;obs; (d) one (f) resorts to an 
arbitrary fixed bound a on the p-value, while the other _refers to the boundary 
probability of 1/2 (unless a genuine loss function is constructed) A consequent 
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literature (see, e.g. Berger and Sellke, 1987) has since then shown how divergent 
those two approaches could be (to the point of being asymptotically incompat- 
ible). 

While the gap between frequentist and Bayesian degrees of evidence was cer- 
tainly the reason for Lindley (1957) mentioning a statistical paradox, I thus 
remain convinced that the richest consequence of Jeffreys's (1939) and Lind- 
ley 's (1957) exhibitions of this paradox is to highlight the genuine difficulty in 
using improper or very vague priors in testing settings: as stressed by Lindley 
(1957), "the only assumption that will be questioned is the assignment of a 
prior distribution of any type" (p. 188). This were also the arguments made by 
both Shafer (1982) and DeGroot (1982) (see also DcGroot, 1973) in their dis- 
cussion of the paradox. Note that Jeffreys does not address the general problem 
of using improper priors in testing, using ad-hoc solutions when available and 
developing a second (and under-appreciated) type of Jeffreys's priors otherwise 
(see Robert et al., 2009, Section 6.4, for a discussion). 

The plan of this note is as follows: it reviews the paradox in Section 2, analyses 
the recent criticism on Spanos (2013) in Section 3, discusses the Bayesian aspects 
of the paradox in Section 4, and concludes in Section 5. 

2. The paradox, paradoxes, or non-paradox 

Let us first recall the setting set in Lindley (1957). If one considers a normal 
mean testing problem, 

Xn-M{e,a^/n), Ho: 6^00, 

using Jeffreys's (1939) choice of prior, 6 ~ M{9o, cr^), leads to the Bayes factor 

«B(i„) = (1 + 7i)V^ cxp {-ntl/2[l + n]) , 

where tn = y/n\xn — ^ol/f is the classical t-tcst statistic. 

The first level of the paradox is that, when tn is fixed and n to infinity, the 
Bayes factor goes to infinity while the p-value remains constant. In Lindley's 
words, "we [can be] 95% confident that 9 ^ Oq but have 95% belief that 9 = 9o" 
(p. 187). As discussed previously in the literature, this is not a mathematical 
paradox as the quantities measure different objects (the probability measure of 
an event over the sample space versus the probability measure of an event over 
the parameter space, the former being conditional on the parameter value and 
the later on the observation of the sample) and this is not a statistical paradox 
in that observing a constant^ i„ as n increases is not of interest: when Hq is 
true, tn has a limiting A/'(0, 1) distribution, while, when Hq does not hold, t„ 
converges almost surely to oo, in which case the Bayes factor converges to 0. 
This behaviour is thus entirely compatible with the result of the consistency of 
the Bayes factor in this setting.^ 

^As pointed out by Lindley (1957): "5% in to-day's small sample docs not mean the same 
as 5% in to-morrow's large one" (p. 189). 

^One could almost argue that the true paradox is that this consistency is overlooked in 
most commentaries on the Jeffreys-Lindley's paradox. 
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At a second level of interpretation for the above setting, if we shift the mean- 
ing of n from being a sample size to being a prior scale factor, namely if we set 
that the prior variance is n times larger than the observation variance (or that 
the prior is n times less precise),'^ the result derived from the above expression 
is that when the scale n goes to infinity, the Bayes factor goes to infinity no 
matter what the value of the observation is. (Note that both interpretations are 
mathematically equivalent.) Now, under this new light, n becomes what Lindley 
(1957) calls "a measure of lack of conviction about the null hypothesis" (p. 189), 
a sentence that I re- interpret as the prior (under Hi) getting more and more dif- 
fuse as n grows. I must however stress that nowhere in the paper is the difficulty 
with improper (or very large variance) priors discussed. 

In this perspective, I also consider that the phenomenon still is not a paradox 
per se: when the difFuseness of the (alternative) prior (i.e., under Hi) increases, 
the only relevant piece of information becomes that 6 could be equal to 9o, to the 
extent that it overwhelms any evidence to the contrary contained in the data. 
For one thing, and as put by Lindley (1957), "the value 9o is fundamentally 
different from any value of 6* ^ Oq, however near 9o it might be" (p. 189).^ 
For another thing, the mass of the prior distribution in the vicinity of any fixed 
neighbourhood of the null hypothesis and even in any set coherent with the data 
at hand vanishes to zero. There is therefore a deep coherence in the selection 
of the null hypothesis Hq in this case: being completely indecisive about the 
alternative hypothesis means we simply should not chose it. It is impossible 
to pick the alternative hypothesis against the very special value 9o if we want 
to be "completely non- informative" about 9 under Hi. Depending on one's 
perspective about Bayesian statistics, one might see this as a strength or as a 
weakness since Bayes factors and posterior probabilities do require a realistic 
model under the alternative when p- values and Bayesian predictives do not. 

3. Don't be afraid... 

Under the provocative^ title of "Who should be afraid of Lindley 's paradox",^ 
Spanos (2013) offers his frequentist reassessment of the paradox, arguing against 
both Bayesian and likelihood ratio approaches and in favour of the postdata 
severity evaluation he and Mayo have both been advocating since 2004. 

First, let me stress that the notion of evidence is never defined throughout 
the paper, even though it is repeatedly mentioned therein. My experience is that 
the notion widely fluctuates according to its user, ranging from vague facts to 

■^Or yet that, in terms of de Finetti's imaginary observations, the prior corresponds to the 
information brought by one single imaginary observation, as opposed to n real observations. 

*We will get back to this fundamental remark in the discussion of Spanos (2013) in the 
next section. 

^Although the overall style of the paper is quite antagonistic, I will not produce here 
evidence towards the rethorical devices used therein, concentrating on the statistical aspects 
and on their bearings on a re-analysis of the foundations of our field. 

® Given the contents of the paper, the author presumably intends Bayesian statistics or 
Bayesians as the recipient of this question. 
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specific mathematical constructs (see, e.g., Skilling, 2006). Neither is the specific 
purpose of conducting a test (against, say, constructing a confidence interval) 
discussed at all. Reading the discourse of Spanos (2013) makes it sound as 
though there were an obvious truth [Hq or Hi) and as though one and only 
one statistical approach could reach it, despite the evidence (!) to the contrary 
brought by the consistency of the three approaches in Lindley's (1957) setting.^ 
Indeed, what differentiates tests from other aspects of inference is that (a) there 
is a question being asked about the statistical model under study and (b) the 
answer to this question will impact the subsequent actions of the individual 
who asked the question. Point (a) relates to Lindley's (1957) stress on the fact 
that 6*0 is very special indeed and quite different from any neighbouring value: 
it was select for a reason and with a motive, brought forward by a theoretical 
construct rather than inspired from the data. From a Bayesian perspective, 
this implies prior information is available as to why is a special value of 
the parameter 9. Point (b) is about assessing the consequences of the answer 
to the questions, especially the wrong answer. Both from a frequentist and 
from a Bayesian perspective, this implies defining a loss or utility function that 
quantifies the impact of a wrong answer and eventually determines the boundary 
between acceptance and rejection.® Unfortunately, the remark "the problem 
does not lie with the p- value or the accept/reject rules as such, but with how 
such results are transformed into evidence for or against Hq or a particular 
alternative" (p. 76) does not proceed into a decisional step but instead into 
the introduction of a secondary p- value bound, the severity evaluation, coupled 
with a parameter value that requires a distance from the null and in fine an 
implicit loss function determining what is far and what is not. For instance, 
when Spanos (2013, p. 75) states that "there is nothing fallacious or paradoxical 
about a small p- value or a rejection of the null, for a given significance level a; 
when n is large enough, since a highly sensitive test is likely to pick up on tiny (in 
a substantive sense) discrepancies from iJo", the "substantive sense" can only 
be gathered from a loss function. The conclusion that "what goes wrong is that 
the Bayesian factor and the likelihoodist procedures use Euclidean geometry 
to evaluate evidence for different hypotheses when in fact the statistical testing 
space is curved" (p. 90) is mathematically meaningless when considering that the 
Bayes factor is invariant under one-to-one reparameterisation, hence impervious 
to the curvature of both the parameter and the sampling spaces. 

Second, Spanos (2013) argues that the Jeffreys-Lindley paradox is demon- 
strating against the Bayesian (and likelihood) resolutions of the problem for 
failing to account for the large sample size.^ I do not disagree with this per- 
spective to the extent that I consider that the most important lesson learned 

''Ironically, the numerical example used in the paper (borrowed from Stone, 1997, also 
father to the marginalisation paradoxes, see Dawid et al., 1973) is the very same as Bayes's 
billiard example (if with a larger value of n) and as Laplace's example on births (with a similar 
value of n) . 

*This is the simplest type of loss function: more advanced versions could include the case 
of a non-decision, calling for more observations, as in Berger (2003). 

^The argument about the invariance of the Bayes factor to n (p. 84) is found missing as 
the Bayes factor does depend on n as exhibited by 23 (t^) above. 
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from Lindley (1957) is that vague priors require special caution when conducting 
point-null hypothesis testing. There seems indeed to be little sense in arguing 
in favour of a procedure that would always conclude by picking the null, no 
matter what the value of the test statistics is. However, as already stressed in 
the introduction, considering a fixed value of the t statistic has little meaning 
in an asymptotic referential, i.e. when n increases to oo. Either the t statistic 
converges in distribution to the t distribution under the null hypothesis Hq or 
it diverges to infinity under the alternative Hi . This is the reason why both the 
Bayesian and the likelihood ratio approaches are consistent in this setting. 

In a global pondering about hypothesis testing, I would actually argue that 
the Jeffreys-Lindley paradox expresses difficulties for all of the three method- 
ological threads: when following Fisher's approach, there is a theoretical and 
practical difficulty as to one should decrease the acceptace bound a = a{n) 
on the p- value when n increases. It fails to provide a principle on which this 
bound (or sequence of bounds) a{n) should be chosen. For instance, the paper 
mentions (p. 78) that because "of the large sample size, it is often judicious to 
choose a small type I error, say a = .003" but this sentence simply points at 
the arbitrariness of this numerical value. Or, worse, that it was dictated by the 
data since the observed p-value takes the nearby .0027 value. In addition, I find 
the argument of consistence inconvincing in that case since both the Bayes fac- 
tor and the likelihood ratio tests are then consistent testing procedures. In the 
Neyman-Pearson referential, I have a difficulty in finding a proper balance or 
imbalance between Type I and Type II errors, since such balance is not pro- 
vided by the theory, which settles for the sub-optimal selection of a fixed Type 
I error. In addition, I have troubles with the whole notion of power, due to the 
fact that it is a function that depends on the unknown parameter. In particular, 
the power decreases to the Type I error at the boundary of the parameter set 
between the null and the alternative hypotheses. Without a prior distribution, 
giving a meaning to something like (eon. (25), p. 87) 

P(x; d(X) < rf(xo); 6i > 6ii is false) 

seems impossible. As discussed further in other sections of this note, apart 
from the genuine difficulty in setting a prior distribution, following a standard 
Bayesian approach with a fiat prior on the binomial probability infered about 
in Spanos (2013) leads to a Bayes factor of 8.115 (p. 80). Since this is neither a 
huge nor a tiny quantity per se, the very difficulty is in calibrating it, Jeffreys's 
(1939, Appendix) scale being highly formal. 

Third, Spanos (2013) uses the failures (or fallacies?) of all three main ap- 
proaches to address the difficulties with the Jeffreys-Lindley paradox to ad- 
vocate his own criterion the "postdata severity evaluation" introduced in an 

^''in connection with this point, I fail to understand why a Bayes factor would "ignore 
the sampling distribution (...) by invoking the likelihood principle" (p. 90): the Bayes factor 
incorporates the sampling distribution by integrating out against the associated prior under 
the alternative hypothesis. 

^^Aftcr an exchange with D. Mayo (2013, personal communication), it appears that this 
probability is computed under the distribution of X associated with the parameter 6i . 
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earlier paper with Deborah Mayo (Mayo and Spanos, 2004).^^ The notion of se- 
vere tests has been advocated by Mayo and Spanos over the past years, but it has 
not yet had any impact on the practice of statistics: in my opinion, the solution 
seems to require even further calibration than the regular p- value and it is thus 
bound to confuse practitioners. Indeed, the severity evaluation as explained^'^ 
in Spanos (2013) implies defining for each departure from the null, 9i = Oq + "f 
the probability that a dataset associated with this parameter values "accords 
less with 6 > 9i than xq does" (p. 87). (Note that the two-sided alternative has 
been turned postdatum into a one-sided version.) This notion is therefore a mix 
of p-value and of type II error that is supposed to "provide the 'magnitude' 
of the warranted discrepancy from the null" (p. 88), i.e. to decide about how 
close (in distance) to the null we can get and still be able to discriminate the 
null from the alternative hypotheses. As discussed in the paper, the value of 
this closest discrepancy 7 — which is thus a bound on when we can discriminate 
between Hq and Hi at a given sample size — does depend on another arbitrary 
tail probability, the "severity threshold" , 

Ve^{d(X) < d{xo)} , 

since this probability has to be chosen by the experimenter without being more 
intuitive than the initial acceptance bound on the p-value. "'^^ Further, once the 
resulting discrepancy 7 is found, whether it is far enough from the null is a 
matter of informed opinion as, as duly noted by Spanos (2013), whether it 
is "substantially significant (...) pertains to the substantive subject matter" 
(p. 88), implying once more some sort of loss function that is ignored (or implicit) 
throughout the paper. 

In connection with the special meaning of the value Oq, several parts of 
Spanos' discussion of the Bayesian approach argue (sec, e.g., p. 81) about other 
values of 6 that are supported and even better supported by the data than the 
null value 9o ■ This is a surprising argument as it pertains to the construction of 
Bayesian credible intervals but not to testing. While it is correct that the ob- 
served data Xq does "favor certain values more strongly" (p. 81) than Oq, those 

^^Section 6 starts with the mathematically puzzling argument that, since we have observed 
XQ, the sign of xq — 0o "indicates the relevant direction of departure from -Ho" - First, random 
variables may take values both sides of 60 for most values of 9. Second, the fact that one is 
testing Hq against a two-sided or a one-sided alternative hypothesis pertains to the motivation 
of the test, not to the direction suggested by the data. The contentious modification of the 
testing setting once the data is observed is an issue with Spanos' (2013) perspective that we 
will discuss further. 

^■^Let me remark that typos in both the last line in p. 87, which is mixing the standard- 
ised and the non-standardised versions of the test statistic, and Table 1, which introduces a 
superfluous minus sign, do not help in clarifying the issue. 

^^When considering the severity as a function of 8t, complement to a probability cdf in 
61 , the most natural interpretation would be Bayesian, the bound being a quantile. However, 
this solution is quite improbable to meet with the authors' approval. 

^^While this is very much unlikely to be advocated either by the author or by Bayesian 
statisticians, we note that, as a statistics, i.e. a transform of the data, both the Bayes factor and 
the likelihood ratio could be processed in exactly the same way to produce severity thresholds 
of their own. 
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values are (a) driven by the data, i.e. will vary from one repetition of the ex- 
periment to the next, and (b) of no particular relevance for conducting a test, 
meaning that the experimenter or the scientist behind the experiment had not 
expressed a particular interest in those values before they were exposed by the 
data. The tested value, 6*0 = 0.2 say, is chosen prior to the experiment because it 
has some special meaning for the problem at hand. The fact that the likelihood 
and the posterior are larger in other values of 9 does not "constitute conflicting 
evidence" against the fact that the null hypothesis holds. Or does not hold. It 
simply reflects on the fact that the likelihood function is a random function of 
the parameter 6, whose mode also varies with the data and is almost surely not 
located at the true value of the parameter. Even under the null. 

4. On some resolutions of the Bayesian version 

While the divergence between the frequentist and Bayesian answers is reflecting 
upon the difference between the paradigms in terms of purpose and evaluation, 
the (Bayesian) debate about constructing limiting Bayes factors or posterior 
probabilities that include improper prior modelling stands both open and rel- 
evant. DeGroot's (1982) warning that "diffuse prior distributions (...) must be 
used with care" has now been impressed upon generations of students and it 
is indeed a fair warning. There remains nonetheless a crucial need to produce 
assessments of null hypotheses from a Bayesian perspective and under limited 
prior information, once again without any incentive whatsoever to mimic, re- 
produce or even come close to frequentist solutions like p-valucs. (I will there- 
fore abstain from covering here the notion of matching priors^ whose sole pur- 
pose is to bring frequentist and Bayesian coverages as close as possible, see e.g. 
Datta and Mukerjee, 2004.) 

In Robert (1993), I suggested selecting the prior weights of the two hypothe- 
ses, (go, 1 — Qo), hi order to compensate for the increased mass brought by the 
alternative hypothesis prior. While the solution therein produced numerical 
results that brought a proximity with the p- value, its construction is flawed 
from a measure-theoretic point of view since the determination of the weights 
involves the value of the prior density tti at the point-null value Oq, 

£•0 = (1 - £'o)7ri(6'o) , 

a difliculty also shared by the (related) Savage-Dickey paradox (Robert and Marin, 
2009)."'^'' I nonetheless remain of the opinion that the degree of freedom repre- 
sented by the prior weight go in the Bayesian formalism should not be neglected 
to overcome the difficulty in using improper priors.^* 

^®The compensation cannot be probabilistic in that the overall mass of an improper prior 
will remain improper. 

^^A solution to the measure-theoretic difficulty is to impose a version of tti that is con- 
tinuous at 9o so that tti (60) is uniquely defined, ft however equates the values of two density 
functions under two orthogonal measures. 

^*Some will object at this choice on Bayesian grounds as it implies that the prior does 
depend on the sample size n. 
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Another direction worth pursuing is Berger et al.'s (1998) partial validation 
of the use of identical improper priors on the nuisance parameters, a notion 
aheady entertained by Jeffreys (1939, see the discussion in Robert et al., 2009, 
Section 6.3). While arguing about the "same" constant in both models towards 
using the "same" improper prior for both models has no mathematical nor 
statistical validation, using the same prior eliminates quite conveniently the 
major thorn in the side of Bayesian testing of hypotheses. As demonstrated in 
Marin and Robert (2007) and Celeux ct al. (2012), it allows in particular for the 
use of a partly improper (/-prior in linear and generalised linear models (ZcUncr, 



Yet another resolution to the paradox is apparently found in DeGroot's (1982) 
recommendation to keep "in mind that the assignment of a prior distribution to 
the parameter 9 induces a predictive distribution for the observation" (p. 337), 
as comparing predictives allows for an assessment of Bayesian models (meaning 
that either the sampling or the prior distribution may be inadequate) . However, 
I think Morrie DeGroot meant in this text using the prior predictive. 



in which case this approach is essentially equivalent to the Bayes factor, hence 
docs not solve the improperness issue, and suffers from the same calibration 
difficulty. If, instead, one considers the posterior predictive, this is the solution 
advocated in, among others, Gelman et al. (2003), under the name of posterior 
predictive checking, but it implies using the data twice (once for building the 
posterior and one for deriving the assessment), and has been reinterpreted by 
Aitkin (1991, 2010) in his integrated likelihood theory, drawing strong criticism 
from many, including Dennis Lindley's now famous "One hardly advances the 
respect with which statisticians arc held in society by making such declarations" 
(1991, p.131). (Sec also Gelman et al., 2013.)^° 

A last direction worth investigating is the recent development of the use of 
score functions S{x, m) that extend the log score function associated with the 
Bayes factor: 

logSi2(a;) = logmi(a;) - logm2(.T) = So{x,mi) - So{x,m2) , 

where is the prior predictive associated with model IDti . Indeed, there exists a 
whole family of proper scoring rules that are independent from the normalising 

^^Once again, choosing g = n should attract criticism from some Bayesian corners for 
being dependent on the sample size, even though it boils down to picking an imaginary sample 
(Smith and Spiegelhalter, 1982) size of 1. Sec Liang et al. (2008) for an alternative approach 
setting an hyperprior on g. 

•^''Although a huge literature has been dedicated to partial Bayes factors like fractional 
and intrinsic Bayes factors where a part of the dataset is used to make the posterior 
distribution well-defined and the remainder addresses the testing question, as started in 
Smith and Spiegelhalter (1982), I will not pursue this direction as (a) it is very rarely a 
truly Bayesian procedure, i.e. cannot be expressed as a genuine Bayes factor against a pair of 
proper prior distributions, and (b) it suffers from facing too many competing variants to be 
advocated. See e.g. Berger and Pericchi (2001) or Robert (2001) for a review. 



1986). 
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constant of the prior predictive (Parry ct al., 2012) and can thus be used on 
improper priors as weh. For instance, Hyvarinen's (2005) score is one of these 
scores. While the scores are dehcate to cahbrate, i.e. the magnitude of S{x, mi) — 
S{x,'m2) is not absolute, they provide a consistent method for selecting models 
(REF) and avoid the delicate issue of selecting priors that differ for model 
selection and for regular inference (conditional on the model). 

5. Reflections 

The appeal of great paradoxcs^^ is to exhibit foundational issues in a field, either 
to reinforce the arguments in favour of a given theory or, on the opposite, to 
cast serious doubts on its validity. The fact that the JefTreys-Lindley's paradox 
is still discussed in papers (as exemplified by the recent Spanos, 2013) and blogs, 
by statisticians and non-statisticians alike, is a testimony to its impact on the 
debate about the very nature of (statistical) testing. The irrevocable opposition 
between frequentist and Bayesian approaches to testing, but also the persistent 
impact of the prior modelling in this case, are fundamental questions that have 
not yet met with definitive answers. And they presumably never will for, as 
put by Lad (2003), "the weight of Lindley's paradoxical result (...) burdens 
proponents of the Bayesian practice" . However, this is a burden with highly 
positive features in that it paradoxically (!) drives the field to higher grounds. 
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